17/3/2017

INEL

  • Grammars, corpora and language technologies for Indigenous Northern Eurasian Languages (website)
  • A long-term project (18 years) funded by Academy of Sciences and Humanities in Hamburg
  • The project was applied for by Prof. Dr. Beáta Wagner-Nagy, Dr. Michael Riessler and the management of HZSK
  • Research:
    • Institut für Finnougristik / Uralistik at Universität Hamburg (IFUU)
  • Infrastructure:
    • Hamburger Zentrum für Sprachkorpora (HZSK)

Data first phase

  • Kamas
    • Kamassisches Wörterbuch (Donner, 1944), glossing complete
    • Audio from last speaker, currently being transcribed
  • Selkup
    • A.I. Kuzmina's archive at IFUU: field notes, recordings…
  • Dolgan
    • Texts from existing publications (folklore)
    • Audio recordings from different sources

Pecularities of Dolgan case

  • Variety of transcription systems
    • Transliterated to INEL conventions
  • Some printed publications rather new
    • High print quality, easy OCR
  • Sentence-level translation to Russian
    • Need for aligning
    • Russian translations often in internet already

Basics of INEL OCR workflow

Tools used

  • We have been using ABBYY Finereader
  • First goal is to bring texts into FLEx with metadata
    • Toolbox file as interchange format
  • Audio is aligned later in EXMARaLDA
  • Git is used as version control across work phases
    • Most relevant in EXMARaLDA stage
  • Hunalign has been practical in alignment checking

Overview to the current workflow

Paradox with ABBYY

Desktop version

  • Good user interface
  • Training new models easy, although not transparent
  • Practical to do fast post-correction after OCR
  • No XML export

Engine version

  • Used from command line
  • Only pre-defined models
  • Cannot be post-corrected in Abbyy Desktop
  • Good ALTO XML export

How is this possible?

  • In defence of ABBYY, other OCR tools suffer with same
  • Maybe it simply is difficult to manage user edits and word/letter position coordinates?

Currently lost information

  • Coordinate information on page
  • Some formatting
  • Footnotes are later added manually as notes
  • Time is wasted!

Why is this a problem? (1/2)

  • Reconstructing paragraphs from plain text very unreliable
    • Word coordinate on page gives more cues
  • Distinguishing different numberings from one another hard
    • Page numbers, chapter numbers etc.
  • Accurate page information nice for citations

Why is this a problem? (2/2)

  • No chances to deal with more complex layout in the document
  • Hard to make nice ebooks automatically!
    • Ebooks with broken paragraph structure annoying
  • Hard to make nice digital facsimiles!
    • Relevant for older and rarer items
  • Unnecessary information loss is never desirable!

Future tasks

Combining text and coordinates

  • Text files in some collections are nicely corrected
  • XML files contain coordinate info
    • It must be possible to do matching between these two files?

Combining corpora for research purposes

  • Same speakers and writers (Albert Vanejev, Jevgeni Igušev)







Аттьӧ! Thank you! Спасибо!


CC-BY – Niko Partanen / INEL – 2017


More information and publication here.